Skip to content

fix: stop copying root tool installs into sandbox home#181

Merged
xdotli merged 16 commits intobenchflow-ai:dev-0.3from
EYH0602:fix/sandbox-copy
Apr 25, 2026
Merged

fix: stop copying root tool installs into sandbox home#181
xdotli merged 16 commits intobenchflow-ai:dev-0.3from
EYH0602:fix/sandbox-copy

Conversation

@EYH0602
Copy link
Copy Markdown
Contributor

@EYH0602 EYH0602 commented Apr 22, 2026

Summary

Closes #178.

  • Shared install prefixes (option 2): setup_sandbox_user() no longer recursively copies /root/.nvm, /root/.local/bin, or heavyweight tool trees into the sandbox home. Agent binaries are expected to live on shared prefixes like /usr/local/bin or /opt.
  • Symlink fallback (option 1): legacy images that still install tools under /root get a narrow ln -s instead of a full cp -a, keeping setup instant.
  • Skills linked, not copied: deploy_skills() now symlinks the shared skills tree into agent discovery paths (ln -sfn) instead of cp -r.
  • Registry contract tightened: get_sandbox_home_dirs() returns only config/auth dirs, not tool install dirs. New registry invariant test enforces that install_cmd targets shared prefixes.

Test plan

  • pytest tests/test_sandbox_setup.py — contract tests for the generated setup command
  • pytest tests/test_agent_setup.py — skill deployment uses symlinks, not copies
  • pytest tests/test_registry_invariants.py — all agents install to shared prefixes
  • pytest tests/test_sandbox.py — home dir derivation matches new semantics

Open in Devin Review

xdotli and others added 15 commits April 21, 2026 16:38
release: benchflow 0.3.0 — Scene lifecycle, multi-agent, CLI redesign
The oracle agent runs solution/solve.sh and never calls an LLM, but
resolve_agent_env() was validating API keys for whatever model the CLI
defaulted to (claude-haiku-4-5-20251001). This made `bench eval create
-a oracle` fail without ANTHROPIC_API_KEY set, even though oracle
doesn't need it.
Move the fix from resolve_agent_env to the CLI layer: oracle runs
solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL
at all. Both _run_single and _run_batch now pass model=None for oracle.
Widen JobConfig.model to str | None to support this.
…key-check

fix: skip model/API-key validation for oracle agent
PR benchflow-ai#173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to
cli/eval.py, but cli/eval.py is orphaned (never imported into the live
CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and
trips ANTHROPIC_API_KEY validation. Three changes:

- Restore the `agent != "oracle"` guard in resolve_agent_env so the
  chokepoint defends against any caller that forwards a model.
- Delete the orphan cli/eval.py and its tests — the live eval_create
  lives in cli/main.py and was the actual code path users hit.
- Add effective_model(agent, model) helper, change JobConfig.model
  default to None, replace seven `model or DEFAULT_MODEL` sites in
  cli/main.py and job.py YAML loaders so oracle gets honest model=None
  end-to-end (in result/summary JSON, prints, and downstream Trial).

Regression test in test_resolve_env_helpers.py pins the chokepoint by
calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key
and no host auth — verified to fail on main with the user-facing
ANTHROPIC_API_KEY error and pass after the fix.
Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer
of the prior fix at the right altitude:

- TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no
  src/ file references benchflow.cli.eval, guarding against a future
  re-introduction that could swallow the next bug fix the same way.
- TestEvalCreateRouting — `bench eval create` callback lives in
  cli/main.py:eval_create. Pins the architectural fact PR benchflow-ai#173 missed.
- TestEffectiveModel — unit tests for the helper: oracle drops model,
  non-oracle falls back to DEFAULT_MODEL, empty string treated as unset.
- TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None
  for both native and Harbor formats; non-oracle backwards-compat
  preserved.
- TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle")
  with no API key in env does not raise. Mocks Trial.create and resets
  the asyncio loop after to avoid polluting pre-existing tests that use
  the deprecated asyncio.get_event_loop() pattern.

Verified to fail on main in the right shape: 9 of 14 fail (each pinning
a deleted/added behavior), 5 pass (asserting structural facts already
true). The CLI test fails on main with the user-reported error
"ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…".
The previous commit deleted cli/eval.py and its tests as orphans, but
they are intentionally kept. Restore both from main, update eval.py to
use the effective_model() helper for the oracle chokepoint fix, and
replace the "module is gone" regression test with a guard that cli/main.py
does not import cli/eval (the actual invariant).
…t-and-cleanup

fix: oracle chokepoint guard + effective_model helper
Plan step 1/6: Lock the new sandbox contract in tests
Plan step 2/6: Narrow setup_sandbox_user() to user state only
Plan step 3/6: Align registry semantics with the new contract
Replace per-trial skill-tree copies with ln -sfn into a shared /skills (or
task skills_dir) root, drop skill_paths from get_sandbox_home_dirs(), and
add registry + sandbox-setup invariants that keep agent binaries on
/usr/local/* rather than /root-only home paths. Updates task-authoring
and api-reference docs to describe the new lightweight sandbox contract.
EYH0602 added a commit to EYH0602/benchflow that referenced this pull request Apr 23, 2026
@xdotli xdotli changed the base branch from main to dev-0.3 April 25, 2026 07:51
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 6 additional findings in Devin Review.

Open in Devin Review

Comment on lines 135 to 136
else:
logger.info("Skills already injected via Dockerfile")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 deploy_skills uses wrong source path when skills are already injected via Dockerfile

When skills_dir is provided AND the Dockerfile already contains COPY _deps/skills /skills/ (the already_injected path at src/benchflow/_agent_setup.py:126), effective_skills stays set to task_skills_dir (line 117) instead of being updated to "/skills". In the old code, effective_skills = "/skills" if skills_dir else task_skills_dir unconditionally set it to "/skills" whenever skills_dir was truthy, which correctly covered this case.

If task_skills_dir is None (common when the task doesn't declare environment.skills_dir), the condition at line 139 (if effective_skills and ...) is False, so no skill linking happens at all — skills are silently not distributed to agent discovery paths despite being present at /skills in the container.

Suggested change
else:
logger.info("Skills already injected via Dockerfile")
else:
logger.info("Skills already injected via Dockerfile")
effective_skills = "/skills"
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

xdotli added a commit that referenced this pull request Apr 25, 2026
Brings 126 ruff errors → 0 so CI's lint check goes green and unblocks
the 5 PRs targeting dev-0.3 (#176, #180, #181, #182, #191) that were
landing on top of pre-existing repo lint debt.

What changed:
1. Auto-fixes via `ruff check --fix --unsafe-fixes`:
   - 40 F401 unused-imports across src/, tests/, examples/
   - 8 I001 unsorted-imports
   - 6 UP037 quoted-annotations modernized
   - Other auto-fixable rules

2. Hand fixes:
   - src/benchflow/__init__.py: removed `Trial` from the `from harbor`
     re-export block (it was shadowed by `from benchflow.trial import Trial`
     at line 65, which is the canonical public Trial). Added
     `trial_config_from_yaml` to __all__.
   - src/benchflow/process.py: 3x `raise ConnectionError(...) from e` for
     B904 (errors raised inside except clauses).
   - src/benchflow/mcp/reviewer_server.py: same B904 fix for fastmcp
     ImportError reraise.
   - tests/test_skill_eval.py: raw string for `pytest.raises(match=...)`
     pattern (RUF043).
   - 3 files: replaced `×` (Unicode multiplication sign) in comments and
     f-strings with `x` (latin x) to clear RUF001/RUF003.

3. Per-file ignores added to pyproject.toml `[tool.ruff.lint.per-file-ignores]`:
   - `experiments/*.py` and `tests/conformance/*.py` ignore E402 — these
     are standalone scripts that legitimately set sys.path before importing.
   - `src/benchflow/runtime.py` ignores F821 — uses forward references
     resolved by `from __future__ import annotations`; explicit
     TYPE_CHECKING imports would force eager loads.

No code behavior changes. 580 tests pass; the 8 pre-existing failures
(env-leak between subscription auth tests, Docker compose env, judge
model default mismatch) are unrelated to this PR.
# Conflicts:
#	src/benchflow/_agent_env.py
#	src/benchflow/cli/eval.py
#	tests/test_agent_setup.py
#	tests/test_oracle_chokepoint.py
@xdotli xdotli merged commit e66537c into benchflow-ai:dev-0.3 Apr 25, 2026
1 check passed
@EYH0602 EYH0602 deleted the fix/sandbox-copy branch April 25, 2026 20:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

setup_sandbox_user copies hundreds of MB per trial — use symlinks or shared paths instead

2 participants